In [1]:
'''
File name: project.ipynb
Date created: 03/11/2019
Date last modified: 25/11/2019
Python Version: 3.7.4
''';
In [2]:
import pandas as pd
import numpy as np
import geopandas as gpd
import folium
import json
import math
import plotly
import plotly.express as px
import plotly.graph_objects as go

from utils import constants as cst
from utils import clean_database
from utils import areas_handler
from utils import chicago_map_handler as maps

# Set auto-reload 
%load_ext autoreload
%autoreload 2

Load Databases

In this section we load and clean the databases. All methods used are described in utils/clean_database.

Table of Contents

In [3]:
# Load the areas dataframe 
areas_DF = gpd.read_file(cst.AREAS_PATH)

# Clean the dataframe
areas_DF = clean_database.clean_areas_df(areas_DF)
areas_DF.head()
Out[3]:
community_area_num community_area_name shape_area shape_len geometry
0 35 douglas 4.600462e+07 31027.054510 POLYGON ((-87.60914 41.84469, -87.60915 41.844...
1 36 oakland 1.691396e+07 19565.506153 POLYGON ((-87.59215 41.81693, -87.59231 41.816...
2 37 fuller park 1.991670e+07 25339.089750 POLYGON ((-87.62880 41.80189, -87.62879 41.801...
3 38 grand boulevard 4.849250e+07 28196.837157 POLYGON ((-87.60671 41.81681, -87.60670 41.816...
4 39 kenwood 2.907174e+07 23325.167906 POLYGON ((-87.59215 41.81693, -87.59215 41.816...
In [4]:
# Load the food inspections dataframe
food_inspections_DF = pd.read_csv(cst.FOOD_INSPECTIONS_PATH, sep = ',', header = 0, 
                   names = cst.FOOD_INSPECTIONS_COL_NAMES, index_col = None, error_bad_lines=False
                   )

# Clean the dataframe
food_inspections_DF = clean_database.clean_food_inspections_df(food_inspections_DF, areas_DF)
food_inspections_DF.head()
Out[4]:
inspection_id DBA_name AKA_name license_num facility_type risk address city state zip inspection_date inspection_type result violations lat lng location
0 2345323 ARMAND'S PIZZERIA ARMAND'S PIZZERIA 2698587.0 Restaurant Risk 1 (High) 29 N WACKER DR chicago IL 60606.0 2019-11-08T00:00:00.000 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.882700 -87.636638 {'latitude': '-87.63663755997726', 'longitude'...
1 2345321 GOPUFF GOPUFF 2684560.0 Grocery Store Risk 3 (Low) 1801 W WARNER AVE chicago IL 60613.0 2019-11-08T00:00:00.000 License Re-Inspection Pass NaN 41.956846 -87.674395 {'latitude': '-87.6743946694658', 'longitude':...
2 2345325 TACO MAX MEXICAN GRILL TACO MAX MEXICAN GRILL 2699082.0 Restaurant Risk 1 (High) 3402 W MONTROSE AVE chicago IL 60618.0 2019-11-08T00:00:00.000 License Pass w/ Conditions 3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E... 41.961238 -87.713284 {'latitude': '-87.71328438033805', 'longitude'...
3 2345370 CAFE BALLOU CAFE BALLOU & DELI 2433048.0 Restaurant Risk 1 (High) 939 N WESTERN AVE chicago IL 60622.0 2019-11-08T00:00:00.000 Canvass No Entry NaN 41.898706 -87.686773 {'latitude': '-87.68677251748062', 'longitude'...
4 2345376 GARIBAY POULTRY GARIBAY POULTRY 1908500.0 CUSTOM POULTRY SLAUGHTER Risk 2 (Medium) 2100 S CALIFORNIA AVE chicago IL 60608.0 2019-11-08T00:00:00.000 Complaint Pass 47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE... 41.853688 -87.695652 {'latitude': '-87.69565174882821', 'longitude'...
In [5]:
# Load the socio-economic indicators dataframe
socio_economic_DF = pd.read_csv(cst.SOCIO_ECONOMIC_INDICATORS_PATH, sep = ',', header = 0, 
                   names = cst.SOCIO_ECONOMIC_COL_NAMES, index_col = None, error_bad_lines=False
                   )

# Clean the dataframe
socio_economic_DF = clean_database.clean_socio_economic_df(socio_economic_DF)
socio_economic_DF.head()
Out[5]:
community_area_num community_area_name housing_crowded_perc housholds_below_poverty_perc aged_16_or_more_unemployed_perc aged_25_or_more_without_high_school_diploma_perc aged_under_18_or_over_64_perc per_capita_income hardship_idx
0 1 rogers park 7.7 23.6 8.7 18.2 27.5 23939 39.0
1 2 west ridge 7.8 17.2 8.8 20.8 38.5 23040 46.0
2 3 uptown 3.8 24.0 8.9 11.8 22.2 35787 20.0
3 4 lincoln square 3.4 10.9 8.2 13.4 25.5 37524 17.0
4 5 north center 0.3 7.5 5.2 4.5 26.2 57123 6.0
In [6]:
# Load the life expectancy dataframe
life_expectancy_DF = pd.read_csv(cst.LIFE_EXPECTANCY_PATH, sep = ',', header = 0, 
                   names = cst.LIFE_EXPECTANCY_COL_NAMES, index_col = None, error_bad_lines=False
                   )

# Clean the dataframe

life_expectancy_DF = clean_database.clean_socio_economic_df(life_expectancy_DF)
life_expectancy_DF.head()
Out[6]:
community_area_num community_area_name life_exp_1990 lower_95_perc_CI_1990 upper_95_perc_CI_1990 life_exp_2000 lower_95_perc_CI_2000 upper_95_perc_CI_2000 life_exp_2010 lower_95_perc_CI_2010 upper_95_perc_CI_2010
0 1 rogers park 70.9 69.9 71.9 73.1 72.2 74.1 77.3 76.3 78.2
1 2 west ridge 76.9 76.1 77.8 78.1 77.3 78.8 80.3 79.5 81.1
2 3 uptown 64.0 63.1 64.9 71.7 70.8 72.7 76.0 75.1 76.9
3 4 lincoln square 74.2 73.1 75.4 76.8 75.8 77.8 80.5 79.3 81.6
4 5 north center 73.4 72.1 74.7 77.9 76.6 79.1 81.5 80.1 82.8

Complete Datasets

A few issues :

  1. We only have the area name for the life expectancy and the socio-economic dataframes. So one solution to be able to compare the different dataframes would be to add a column containing the area of a food inspection location. Since our analysis will be done on the different areas, this seems like the simplest approach to work around the problem.
  2. Some entries in food inspections DF have no lat/lng pair, but they do have an address. We find the lag/long given the address.

The solutions described are implemented in utils/areas_handler.

Table of Contents

In [7]:
# Filter locations where lng/lat are unknown
food_unknown_loc = food_inspections_DF[food_inspections_DF['lat'].isna()]
In [8]:
# Get unknown locations
unknown_locations = areas_handler.get_unknown_locations(food_unknown_loc)
In [9]:
# Check the locations not found by OpenStreetMaps
unknown_locations[pd.isnull(unknown_locations['lat'])].head()
Out[9]:
address lat lng
9 2011 N GRIFFIN BLVD NaN NaN
10 65 CARMINE ST NaN NaN
11 950 ESTES CT NaN NaN
12 108 W PARK NaN NaN
13 1307 CARDINAL DR NaN NaN
In [10]:
# Display the Chicago areas
# This map contains additional layers to visually check if the found locations are actually within the borders of the city
map_chicago = maps.create_chicago_map()
map_chicago = maps.add_locations(map_chicago, unknown_locations, food_inspections_DF) 
map_chicago
Out[10]:

Description of the map

On the map above we can see the city of Chicago and the region where the facilities of the inspections presented in the food_inspections dataset are located. In the following, we will focus ang get some insights on that area.

In [11]:
# Use unknonwn_locations to fill lat and lng in the original dataframe food_inspections_DF
food_unknown_loc = food_unknown_loc.reset_index().merge(unknown_locations, on="address", how='left') \
                                   .set_index('index').drop(['lat_x', 'lng_x'], axis = 1) \
                                   .rename(columns={'lat_y':'lat','lng_y':'lng'})
food_inspections_DF.update(food_unknown_loc)

# Check which latitudes are still unknown 
print('%d food inspections still missing lat long info, out of %d. '%(food_inspections_DF.lat.isna().sum(), len(food_inspections_DF)))
103 food inspections still missing lat long info, out of 195796. 

So we have no more unknown locations in our food inspections dataframe.

In [12]:
# Resolve are numbers and delete unknown areas
# Takes a while if not already saved
food_inspections_DF = areas_handler.get_locations_with_area(food_inspections_DF, areas_DF)
print("Number of locations: " + str(food_inspections_DF.shape[0]))

# Drop locations not in the city of chicago 
food_inspections_DF = food_inspections_DF.dropna(subset=[cst.AREA_NUM])
print("Number of locations in the city of chicago: " + str(food_inspections_DF.shape[0]))
food_inspections_DF[cst.AREA_NUM] = food_inspections_DF[cst.AREA_NUM].astype(int)
Number of locations: 195796
Number of locations in the city of chicago: 192821

Maps

We visualize some of the data using folium to find connections

Table of Contents

Number of inspections map

In [13]:
# Create new dataframe with number of inspections per area
inspection_counts = food_inspections_DF[cst.AREA_NUM].value_counts().to_frame()
inspection_counts.reset_index(level=0, inplace=True)
inspection_counts.columns=[cst.AREA_NUM,'num_inspections']
inspection_counts[cst.AREA_NUM] = inspection_counts[cst.AREA_NUM].astype(str)
inspection_counts.sort_values('num_inspections');
In [14]:
heatmap_chicago = maps.heat_map(inspection_counts, "Inspections", cst.AREA_NUM, 'num_inspections')
heatmap_chicago
Out[14]:

Description of the map

The heatmap above shows the number of inspections on the area we are focusing on and determined before. We can see that the more we approach the city center of Chicago, the higher the number of inspections.

However, it is also curious to see that in the north of Chicago, the number of inspections is higher than in the south. This could be due to a higher amount of establishment in this area; we this check below.

Number of Establishments

Here we determine and analyse the number of establishments per region area.

In [15]:
count_DF = food_inspections_DF.copy()
count_DF = count_DF.drop_duplicates(subset=['license_num'])
count_ser = count_DF[cst.AREA_NUM].value_counts().to_frame()
count_ser.reset_index(level=0, inplace=True)
count_ser.columns=[cst.AREA_NUM,'num_establishments']
count_ser[cst.AREA_NUM] = count_ser[cst.AREA_NUM].astype(str)
count_ser.sort_values('num_establishments');
In [16]:
heatmap_chicago = maps.heat_map(count_ser, "Number of Establishments", cst.AREA_NUM, 'num_establishments')
heatmap_chicago
Out[16]:

Description of the map

The heatmap above shows the number of establishments on the area we want to analyze. Same as before, we observe that the more we approach the city center or the north of the city, the higher the number of establishments. Indeed, the number of inspections is correlated with the number of establishments; the more establishments we have on a given area, the higher the inspections will be on that area.

Average risk per inspection for each area

We compute a "risk_ser" dataframe, which has two columns containing the area number and the risk level. To simplify the risk values, we convert the risk in the food inspections dataframe to an integer.

In [17]:
def risk_to_num(x):
    if x == 'Risk 3 (Low)':
        return 1
    if x == 'Risk 2 (Medium)':
        return 2
    if x == 'Risk 1 (High)':
        return 3
    if x == 'All':
        return 2
    else:
        return x

risk_DF = food_inspections_DF.copy()
risk_DF = risk_DF.dropna(subset=['risk'])
risk_DF['risk'] = risk_DF['risk'].apply(lambda x : risk_to_num(x)).astype(int)
risk_ser = risk_DF.groupby(cst.AREA_NUM)['risk'].mean().to_frame()
risk_ser.reset_index(level=0, inplace=True)
risk_ser[cst.AREA_NUM] = risk_ser[cst.AREA_NUM].astype(str)
In [18]:
heatmap_chicago = maps.heat_map(risk_ser, "Average risk", cst.AREA_NUM, 'risk')
heatmap_chicago
Out[18]:

Description of the map

The heatmap above shows the average risk of the establishments for each community area. The risk of an establishment shows how it could potentially affect the public's health with 1 being the highest and 3 the lowest. Furthermore, high risk establishments are inspected more frequently that low risk establishments.

Number of inspections per establishment

In [19]:
#inspections_per_est = inspection_counts.merge(count_ser, left_on=cst.AREA_NUM, right_on=cst.AREA_NUM).drop(['index'], axis = 1)
inspections_per_est = inspection_counts.merge(count_ser, left_on=cst.AREA_NUM, right_on=cst.AREA_NUM)
inspections_per_est['inspections_per_establishment'] = inspections_per_est.apply(lambda x : x.num_inspections/x.num_establishments, axis = 1)
In [20]:
heatmap_chicago = maps.heat_map(inspections_per_est, "Number of Inspections per Establishment", cst.AREA_NUM, 'inspections_per_establishment')
heatmap_chicago
Out[20]:

Description of the map

The heatmap above shows the number of inspections per establishment and as highlighted before, we can see that the establishments with a high number of inspections present high average risks.

Violations

Each establishment can receive a violation number between 1-44 or 70 and the number is followed by a specific description of the findings that caused the violation to be issued. The higher the number, the better the description. Establishments collecting only high numbers might probably pass whereas the others might probably fail.

An inspection of an establishment can pass, pass with conditions or fail depending on these numbers. The 'pass' condition is given when the establishment were found to have no critical or serious violations. Establishments that received a 'pass with conditions' were found to have critical or serious violations but these violations were corrected during the inspection. Finally, the 'fail' condition is issued when the establishment were found to have critical or serious violations that were not correctable during the inspection.

We will analyse more deeply the violations for milestone 3.

Socioeconomic indicators

In [21]:
# Merge socio-economic and life expectancy df's on the area number and names
socio_life_merged_DF = socio_economic_DF.merge(life_expectancy_DF, how="left", on=["community_area_num", "community_area_name"])
In [22]:
# Select two of the following columns for the heatmap
socio_life_merged_DF[cst.AREA_NUM] = socio_life_merged_DF[cst.AREA_NUM].astype(str)
for c in socio_life_merged_DF.columns:
    print(c)
community_area_num
community_area_name
housing_crowded_perc
housholds_below_poverty_perc
aged_16_or_more_unemployed_perc
aged_25_or_more_without_high_school_diploma_perc
aged_under_18_or_over_64_perc
per_capita_income
hardship_idx
life_exp_1990
lower_95_perc_CI_1990
upper_95_perc_CI_1990
life_exp_2000
lower_95_perc_CI_2000
upper_95_perc_CI_2000
life_exp_2010
lower_95_perc_CI_2010
upper_95_perc_CI_2010
In [23]:
heatmap_chicago = maps.heat_map(socio_life_merged_DF, "Per capita income 2010",cst.AREA_NUM ,'per_capita_income', good_indicator = True)
heatmap_chicago
Out[23]:
In [24]:
heatmap_chicago = maps.heat_map(socio_life_merged_DF, "Life expectancy 2010",cst.AREA_NUM ,'life_exp_2010', good_indicator = True)
heatmap_chicago
Out[24]:

We could compare the map above, displaying the life expectancies, with the percentage of housholds below the poverty line.

In [25]:
heatmap_chicago = maps.heat_map(socio_life_merged_DF, "housholds below poverty line as percentage",cst.AREA_NUM ,'housholds_below_poverty_perc')
heatmap_chicago
Out[25]:

In the first map, some community areas have no entry for their life expectancies. Clearly the life expectancies are higher near the center of Chicago, and on it's north side. These are also the regions with lowest amount of housholds below poverty line. Around West Garfield Park, the life expectancy drops drastically, as do the housholds above poverty line.

As a final conclusion:

The center and north of Chicago mostly have higher life expectancies and lower percentages of households below the poverty line.

With respect to food inspections, these areas also have more restaurants, and more food inspections per restaurant than the south.

Statistics :

Socioeconomic indicators

We report some statistics on the various dataframes.

Table of Contents

In [26]:
# Check: range, mean with confidence interval.
important_columns = ['community_area_num', 'community_area_name', 'housing_crowded_perc',
       'housholds_below_poverty_perc', 'aged_16_or_more_unemployed_perc',
       'aged_25_or_more_without_high_school_diploma_perc',
       'aged_under_18_or_over_64_perc', 'per_capita_income', 'hardship_idx',
       'life_exp_2010']

socio_life_merged_DF[important_columns].describe()
Out[26]:
housing_crowded_perc housholds_below_poverty_perc aged_16_or_more_unemployed_perc aged_25_or_more_without_high_school_diploma_perc aged_under_18_or_over_64_perc per_capita_income hardship_idx life_exp_2010
count 78.000000 78.000000 78.000000 78.000000 78.000000 78.000000 77.000000 73.000000
mean 4.920513 21.739744 15.341026 20.330769 35.717949 25597.000000 49.506494 77.597260
std 3.658981 11.457231 7.499497 11.746514 7.284421 15196.405541 28.690556 4.130879
min 0.300000 3.300000 4.700000 2.500000 13.500000 8201.000000 1.000000 68.800000
25% 2.325000 13.350000 9.200000 12.075000 32.150000 15804.750000 25.000000 74.400000
50% 3.850000 19.050000 13.850000 18.650000 38.050000 21668.500000 50.000000 79.500000
75% 6.800000 29.150000 20.000000 26.600000 40.500000 28715.750000 74.000000 80.500000
max 15.800000 56.500000 35.900000 54.800000 51.500000 88669.000000 98.000000 85.200000

We can see above the mean, standart deviations and confidence intervals for some columns that highlight the socio economic factors of all areas. Of course in the database all community areas are considered equally, and so this does not take into account the sizes of the areas, or the number of people in it. However it is still interesting to analyse this result, to get some insight on what we are dealing with. For example it is surprising that the standart deviation of the income per capita is more than half of its mean! Or that the life expectancies in all areas have an std of more than 4 years.

In what follows we analyse more closely the correlations between the different features in this dataframe.

In [27]:
corr = socio_life_merged_DF[cst.SOCIOECONOMIC_METRICS].corr()
corr
Out[27]:
housing_crowded_perc housholds_below_poverty_perc aged_16_or_more_unemployed_perc aged_25_or_more_without_high_school_diploma_perc per_capita_income hardship_idx life_exp_2010 aged_under_18_or_over_64_perc
housing_crowded_perc 1.000000 0.319403 0.165299 0.875959 -0.541730 0.649574 -0.044064 0.224692
housholds_below_poverty_perc 0.319403 1.000000 0.800084 0.424294 -0.567025 0.803267 -0.691029 0.435894
aged_16_or_more_unemployed_perc 0.165299 0.800084 1.000000 0.355518 -0.656619 0.792294 -0.797766 0.676532
aged_25_or_more_without_high_school_diploma_perc 0.875959 0.424294 0.355518 1.000000 -0.709770 0.802538 -0.136151 0.408878
per_capita_income -0.541730 -0.567025 -0.656619 -0.709770 1.000000 -0.849167 0.566589 -0.754844
hardship_idx 0.649574 0.803267 0.792294 0.802538 -0.849167 1.000000 -0.616442 0.690844
life_exp_2010 -0.044064 -0.691029 -0.797766 -0.136151 0.566589 -0.616442 1.000000 -0.566358
aged_under_18_or_over_64_perc 0.224692 0.435894 0.676532 0.408878 -0.754844 0.690844 -0.566358 1.000000

We now want to make sure that the correlation between all two pairs of metrics both good or both bad is always negative, and that the correlation between 2 pairs of variables of different type is always negative. For this, we create the boolean variable sign_kept: only if the above condition is always met, then sign_kept will be true.

In [28]:
bad_metrics = set(['housing_crowded_perc', 'housholds_below_poverty_perc', 'aged_16_or_more_unemployed_perc', 
               'aged_25_or_more_without_high_school_diploma_perc', 'hardship_idx', 'aged_under_18_or_over_64_perc'])
good_metrics = set(['per_capita_income', 'life_exp_2010' ])
sign_kept = True

for c1 in cst.SOCIOECONOMIC_METRICS:
    for c2 in cst.SOCIOECONOMIC_METRICS:
        if (c1 in bad_metrics and c2 in bad_metrics) or (c1 in good_metrics and c2 in good_metrics):
            if corr[c1][c2] < 0:
                sign_kept = False
        elif (c1 in bad_metrics and c2 in good_metrics) or (c1 in good_metrics and c2 in bad_metrics):
            if corr[c1][c2] > 0:
                sign_kept = False
                
if sign_kept: 
    print('The correlation between indicators both good or both bad is always positive,\n and the correlation between a good and a bad indicator is always negative')
The correlation between indicators both good or both bad is always positive,
 and the correlation between a good and a bad indicator is always negative
In [29]:
# Set correlation between each variable and itself to None in order to ignore it later
for c in corr.columns:
    corr[c][c] = None 
    
corrmax =pd.DataFrame(corr.idxmax()).rename({0: 'Strongest positive correlation'}, axis = 1)
corrmax['Correlation value'] = corr.max()
corrmax
Out[29]:
Strongest positive correlation Correlation value
housing_crowded_perc aged_25_or_more_without_high_school_diploma_perc 0.875959
housholds_below_poverty_perc hardship_idx 0.803267
aged_16_or_more_unemployed_perc housholds_below_poverty_perc 0.800084
aged_25_or_more_without_high_school_diploma_perc housing_crowded_perc 0.875959
per_capita_income life_exp_2010 0.566589
hardship_idx housholds_below_poverty_perc 0.803267
life_exp_2010 per_capita_income 0.566589
aged_under_18_or_over_64_perc hardship_idx 0.690844
In [30]:
corrmin =pd.DataFrame(corr.idxmin()).rename({0: 'Strongest negative correlation'}, axis = 1)
corrmin['Correlation value'] = corr.min()
corrmin
Out[30]:
Strongest negative correlation Correlation value
housing_crowded_perc per_capita_income -0.541730
housholds_below_poverty_perc life_exp_2010 -0.691029
aged_16_or_more_unemployed_perc life_exp_2010 -0.797766
aged_25_or_more_without_high_school_diploma_perc per_capita_income -0.709770
per_capita_income hardship_idx -0.849167
hardship_idx per_capita_income -0.849167
life_exp_2010 aged_16_or_more_unemployed_perc -0.797766
aged_under_18_or_over_64_perc per_capita_income -0.754844

Of the above correlations, we notice certain things: Firstly, we can classify the indicators between good (life expectancy and per capita income) and bad (percentage of crowded houses, percentage of below porverty households, percentage of over 16 unemployed people, percentage fo over 25 people without a high school diploma, the hardship index, and the percentage of people under 18 and over 64), and the correlations between indicators either both good or both bad will always be positive, whereas the correlation between a good and a bad indicator will always be negative (sign_kept).

We also notice that the percentage of people under 18 or over 64 is a strong negative indicator: it is more negatively correlated to per capita income than, for example, the percentage of houses living below the poverty line.

It is indeed quite surprising that per capita average income is not more correlated to the percentage of houses living below the poverty line (correlation is -0.56). We plot the 2 metrics in order to see this:

One reason the linear correlation is so low is that the relationship is exponential. However, we also notice that thetop 5 highest per capita neighbourhoods are not in the top 15 lowest poor households percentage.

We now plot a graph showing the per capita income on the y-axis and the housholds below poverty on the x-axis, of every community area. The plot is interactive, you can hover over a point to see its community area name and the precise values reported.

In [33]:
plotly.offline.init_notebook_mode(connected=True)
In [34]:
fig = px.scatter(socio_life_merged_DF, x = 'housholds_below_poverty_perc', y = 'per_capita_income', \
                hover_data=['community_area_name'])
fig.update_layout(
    xaxis_title="percentage of housholds below poverty",
    yaxis_title="per capita income",
    title=go.layout.Title(text="Community area per capita income with respect to housholds below poverty"),
    )
fig.show()
In [ ]: